Age Distribution of Victims

# Notice that age 0 stands for NA
data$vict_age[data$vict_age == 0] <- NA

# age distribution
ggplot(data, aes(x = vict_age)) +
  geom_histogram(binwidth = 5, fill = "#433E85FF", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Age Distribution of Victims", x = "Age", y = "Frequency") 


The first figure is a histogram showing the frequency distribution of victims across different age ranges. The x-axis represents age, ranging from 0 to 120 while the y-axis represents frequency, with values up to 100,000 or more. The distribution is approximately bell-shaped, indicating a concentration of victims in the middle age ranges (around 30-50), with fewer victims at younger (0-20) and older (70+) ages, which suggests a normal-like distribution with a slight skew towards younger ages.

Age Group Distribution of Victims

# divide age into four categories
data$age_group <- cut(
  data$vict_age,
  breaks = c(-Inf, 18, 40, 60, Inf),
  labels = c("juvenile", "Young adult", "Middle-aged people", "The elderly"),
  right = FALSE
)


ggplot(data[!is.na(data$vict_age), ], aes(x = age_group)) +
  geom_bar(fill = "#25858EFF", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Age Group Distribution of Victims",
    x = "Age Group",
    y = "Count"
  )


The second figure is a bar plot showing the count of victims across predefined age groups. The x-axis categorizes victims into four groups, Juvenile, Young adult, Middle-aged people and the elderly. The y-axis represents the count of victims in each age group, with values ranging from 0 to over 400,000. The Young adult group has the highest count, followed by Middle-aged people and The elderly. The Juvenile group has the lowest count. The distribution emphasizes the disproportionate representation of victims in the young adult and middle-aged categories.

Age Distribution By Area

# Create a boxplot showing age distribution by area
box_age_area <- ggplot(data, aes(x = area_name, y = vict_age, fill = area_name)) +
  geom_boxplot(outlier.color = "black", outlier.size = 0.5, alpha = 0.7) + 
  theme_minimal() +
  labs(
    title = "Age Distribution by Area",
    x = "Area",
    y = "Victim Age",
    fill = "Area"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none")                          

# Display the plot
ggplotly(box_age_area)


This box plot visualization provides insights into how victim age varies across different areas, showing both central tendencies (medians) and variability (IQRs and ranges). Each box plot represents the distribution of ages for victims in a specific area. The box indicates the interquartile range (IQR) — the middle 50% of data. The horizontal line inside each box marks the median age. The “whiskers” extend to show the range of the data, excluding outliers. Dots beyond the whiskers represent outliers, which are ages that fall significantly outside the typical range for that area. Overall,some areas may have younger or older median victim ages. Areas with taller boxes or whiskers have more variability in victim ages.

Crime Severity Distribution By Age Group

# Crime Severity Distribution by Age Group
data$severity_label <- ifelse(data$part_1_2 == 1, "Serious", "Less Serious")

# Significant test: the relationship between different age groups and crime severity
# turn severity into factor
data$severity_label <- as.factor(data$severity_label)

# Chi-squre test
severity_age_table <- table(data$age_group, data$severity_label)
chisq_test <- chisq.test(severity_age_table)
print(chisq_test)
## 
##  Pearson's Chi-squared test
## 
## data:  severity_age_table
## X-squared = 4762.7, df = 3, p-value < 2.2e-16
# output
if (chisq_test$p.value < 0.05) {
  print("Age group has a statistically significant relationship with crime severity.")
} else {
  print("No significant relationship between age group and crime severity.")
}
## [1] "Age group has a statistically significant relationship with crime severity."
ggplot(data[!is.na(data$vict_age), ], aes(x = age_group, fill = severity_label)) +
  geom_bar(position = "fill", alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Crime Severity Distribution by Age Group",
    x = "Age Group",
    y = "Proportion",
    fill = "Crime Severity"
  )

The figure showing the proportion of crime severity categories for each age group. For juveniles, the proportion of serious crimes is relatively low (about 30%), with the remaining 70% being less serious crimes. For young adults, the proportion of serious crimes increases to aroung 55% and the remaining are less serious crimes. For middle-aged people and the elders, the pattern is similar to young adults, with slightly fewer serious crimes.

This visualization highlights how crime severity is distributed across age groups. The proportion of serious crimes tends to decrease with increasing age groups, except for the juvenile. Juveniles are involved in a lower proportion of serious crimes compared to older age groups.

Gender Distribution of Victims By Severity

# Calculate the proportion of crime severity for each gender
# Filter out rows where vict_sex is "-" or NA
clean_data <- data[!is.na(data$vict_sex) & data$vict_sex != "-", ]

# Recode gender codes with clearer labels
clean_data <- clean_data %>%
  mutate(gender_label = recode(vict_sex,
                               "F" = "Female",
                               "M" = "Male",
                               "H" = "Intersex/Other",
                               "X" = "Unknown"))

# Calculate the proportion of crime severity for each gender
severity_gender_data <- clean_data %>%
  group_by(severity_label, gender_label) %>%
  summarise(count = n(), .groups = "drop") %>%
  complete(severity_label, gender_label, fill = list(count = 0)) %>%
  group_by(gender_label) %>%
  mutate(percentage = count / sum(count) * 100)

severity_gender_table <- severity_gender_data %>%
  arrange(gender_label, desc(percentage))

# Display the table
kable(severity_gender_table, format = "html", caption = "Crime Severity by Gender and Percentage") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Crime Severity by Gender and Percentage
severity_label gender_label count percentage
Less Serious Female 197898 56.10149
Serious Female 154852 43.89851
Serious Intersex/Other 70 62.50000
Less Serious Intersex/Other 42 37.50000
Serious Male 237003 59.73325
Less Serious Male 159766 40.26675
Serious Unknown 57399 60.70050
Less Serious Unknown 37162 39.29950
# Plot the pie chart with values annotated
ggplot(severity_gender_data, aes(x = "", y = percentage, fill = severity_label)) +
  geom_bar(stat = "identity", width = 1, alpha = 0.7) +
  coord_polar(theta = "y") +
  facet_wrap(~ gender_label) +  # Use the recoded gender labels
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), size = 4) + # Add percentage labels
  theme_minimal() +
  labs(
    title = "Crime Severity Distribution by Gender",
    x = NULL,
    y = NULL,
    fill = "Crime Severity"
  ) +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

The chart provides a clear comparison of crime severity levels across gender categories, highlighting distinct patterns in crime involvement based on gender. Each pie chart represents the proportion of crime severity levels for different gender categories. Intersex/Other has the highest proportion of Serious crimes (62.5%), followed by the Unknown category (60.7%) and Males (59.7%). Females have the highest proportion of Less Serious crimes (56.1%). There is a clear pattern where males and non-binary categories (Intersex/Other, Unknown) are associated with a greater proportion of serious crimes, while females have a higher association with less serious crimes.

Racial Distribution of Victims

# Filter data to remove invalid or missing race entries
clean_data <- data %>%
  filter(!is.na(vict_descent) & vict_descent != "-")

# Map race codes to full descriptions and group small groups as "Others"
clean_data <- clean_data %>%
  mutate(
    vict_descent = recode(vict_descent,
                          "B" = "Black",       # Map "B" to "Black"
                          "H" = "Hispanic",    # Map "H" to "Hispanic"
                          "W" = "White",       # Map "W" to "White"
                          "X" = "Unknown",     # Map "X" to "Unknown"
                          "O" = "Others",      # Map "O" to "Others"
                          .default = "Others") # Group any unspecified codes as "Others"
  )

# Step 3: Calculate the proportion of each race
race_distribution <- clean_data %>%
  group_by(vict_descent) %>%
  summarise(count = n(), .groups = "drop") %>%          
  mutate(percentage = count / sum(count) * 100) %>%       
  mutate(vict_descent = ifelse(percentage < 5 | vict_descent == "Others", 
                               "Others", vict_descent)) %>% 
  # Merge small groups (<5%) into "Others"
  group_by(vict_descent) %>%
  summarise(count = sum(count), percentage = sum(percentage), .groups = "drop") 
# Recalculate totals

# Create the pie chart
ggplot(race_distribution, aes(x = "", y = percentage, fill = vict_descent)) +
  geom_bar(stat = "identity", width = 1, alpha = 0.7) + 
  coord_polar(theta = "y") +               
  theme_minimal() +                        
  labs(
    title = "Racial Distribution of Victims", 
    x = NULL,                                
    y = NULL,                                
    fill = "Race"                       
  ) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), size = 4) + 
  theme(
    axis.text = element_blank(),  
    axis.ticks = element_blank(), 
    panel.grid = element_blank()  
  )

This pie chart highlights the racial diversity of the victim population, representing the proportions of victims belonging to different racial categories. The Hispanic group forms the largest percentage of victims, comprising over a third of the total population. White and Black victims are the next most represented groups, with White victims being significantly higher than Black. Unknown and Others categories form smaller, but still notable, portions of the distribution.

Two Factors Analysis

Crime Severity Distribution by Age Group and Gender

# crime severity with age and gender
ggplot(data[!is.na(data$vict_age) & !is.na(data$vict_sex) & data$vict_sex != "H", ], aes(x = age_group, fill = severity_label)) +
  geom_bar(position = "dodge", alpha = 0.7) +
  facet_wrap(~ vict_sex) +
  theme_minimal() +
  labs(
    title = "Crime Severity Distribution by Age Group and Gender",
    x = "Age Group",
    y = "Count",
    fill = "Crime Severity"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

This grouped bar chart provides a detailed view of how crime severity differs across gender and age groups, highlighting significant trends in both dimensions.

For Females (F):

Young adults have the highest crime counts, with less serious crimes being slightly more frequent than serious crimes.

Middle-aged people also show significant crime counts, though less than young adults, and the pattern between crime severities is similar.

Crime counts for juveniles and the elderly are much lower.

For Males (M):

Young adults dominate in crime counts, with serious crimes being more frequent than less serious ones (reversing the trend seen in females).

Middle-aged people follow with high counts, though less than young adults, and serious crimes still dominate.

Crime counts for juveniles and the elderly are low, similar to females.

For Non-Binary/Other (X):

Crime counts are minimal across all age groups, but young adults and middle-aged people have slightly higher counts compared to juveniles and the elderly.

Overall, males commit more serious crimes, especially among young adults and middle-aged groups, compared to females. Non-binary/other individuals have relatively low crime counts across all groups. Young adults and middle-aged people dominate in crime counts, while juveniles and the elderly have lower crime involvement across all genders.

Crime Severity by Age Group and Race

# Crime Severity with Age Group and Race
ggplot(data[!is.na(data$vict_age) & !is.na(data$vict_descent) & data$vict_descent != "-", ], 
       aes(x = age_group, fill = severity_label)) +
  geom_bar(position = "fill", alpha = 0.7) +
  facet_wrap(~ vict_descent) +
  theme_minimal() +
  labs(title = "Crime Severity by Age Group and Race", 
       x = "Age Group", 
       y = "Proportion", 
       fill = "Crime Severity") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

This faceted stacked bar chart provides a detailed view of how crime severity varies across age groups and racial categories, allowing for analysis of demographic patterns. Young adults and middle-aged people generally dominate in terms of crime proportions across most racial categories. Juveniles and the elderly consistently have smaller proportions of crimes in all racial categories. The proportion of serious crimes tends to vary across racial groups and age groups. In some racial categories (e.g., D, S, V), serious crimes are slightly higher for young adults and middle-aged groups. In other racial categories (e.g., A, K, X), the proportions of less serious crimes (purple) dominate across all age groups.